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1  Abstract 

We  propose  a  data  migration  mechanism  that  allows  an  explicit  and  controlled  mapping  of  data 
to  memory.  While  read  or  write  copies  of  each  data  element  can  be  assigned  to  any  processor’s 
memory,  longer  term  storage  of  each  data  element  is  assigned  to  a  specific  location  in  the  memory 
of  a  particular  processor.  Our  proposed  integration  of  a  data  migration  scheme  with  a  compiler 
is  able  to  eliminate  the  migration  of  unneeded  data  that  can  occur  in  multiprocessor  paging  or 
caching.  The  overhead  of  adjudicating  multiple  concurrent  writes  to  the  same  page  or  cache  line 
is  also  eliminated.  We  present  data  that  suggests  that  the  scheme  we  suggest  may  be  a  practical 
method  for  efficiently  supporting  data  migration. 


‘This  work  was  supported  by  the  U.S.  Office  of  Naval  Research  under  Grant  N00014-86-K-0310,  and  under 
NASA  contract  INASl-18605  while  the  authors  were  in  residence  at  ICASE,  NASA  Langley  Research  Center. 
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2  Introduction 


It  is  well  known  that  data  distribution  and  load  balance  play  critical  roles  in  determining  the 
performance  one  can  expect  to  obtain  from  distributed  machines.  Data  must  be  moved  from 
processor  to  processor  in  response  to  computational  demands.  One  way  of  supporting  data 
migration  is  to  explicitly  designate  blocks  of  data  that  are  to  be  prefetched  into  the  memory  of 
a  given  processor  and  to  copy  the  data  into  customized  data  structures.  Programs  are  written 
on  each  processor  with  intimate  knowledge  of  the  format  used  to  store  off-processor  data.  For 
some  problems,  this  approach  to  data  distribution  can  be  extremely  efficient.  Programming  in 
this  manner  can  be  very  time  consuming,  and  can  lead  to  programs  that  are  difficult  to  debug. 

Mechanisms  have  been  proposed  to  allow  data  to  migrate  in  an  automatic  fashion.  The 
physical  memory  of  a  multiprocessor  is  viewed  as  a  single  logical  memory.  Data  migrates  to  the 
processors  that  refer  to  particulai  logI>.a.l  memory  addresses.  The  following  methods  have  been 
proposed  to  support  data  migration  [1],  [4],  [3],  [5] 

•  Multiprocessor  paging;  A  logically  shared  memory  is  divided  into  pages  which  are  contigu¬ 
ous,  equal  sized  ranges.  Processors  store  copies  of  required  pages  in  their  local  memories. 
A  page  table  is  used  to  find  the  page  corresponding  to  a  given  address  in  logical  memory. 

•  Multiprocessor  Caching:  Each  processor  stores  copies  of  the  contents  of  address  ranges  in 
a  logically  shared  memory.  A  subset  of  address  bits  are  used  to  determine  the  location  of 
data  in  physical  memory. 

* 

Before  describing  our  methods,  we  will  outline  some  of  the  well  known  shortcomings  of  each 
of  these  data  migration  mechanisms.  One  of  the  principal  difficulties  with  multiprocessor  paging 
is  the  problem  of  false  sharing.  In  a  given  portion  of  code,  most  locations  in  logical  memory 
may  not  be  accessed  by  multiple  processors.  Since  pages  consist  of  ranges  of  contiguous  memory 
locations,  different  portions  of  a  given  page  may  be  accessed  by  different  processors.  At  best, 
false  sharing  will  cause  processors  to  waste  physical  memory  to  store  data  that  will  not  be 
used.  False  sharing  has  the  potential  for  causing  particularly  severe  performance  degradation 
if  multiple  processors  attempt  to  concurrenth  e  to  different  memory  locations  on  a  given 
page.  Decreasing  page  size  will  tend  to  reduce  1;  ■  aring.  However  when  page  size  is  reduced, 

page  table  storage  requirements  and  communicatio..s  latency  overheads  increase. 

In  multiprocessor  caching,  we  may  also  see  false  sharing  when  a  large  cache  line  size  is 
chosen  or  alternately  experience  significant  communications  latency  effects  when  small  cache 
lines  are  employed.  Furthermore,  since  the  cost  of  obtaining  off-processor  data  may  be  very 
high  compared  to  the  cost  of  fetching  data  from  local  memory  or  cache  ,  we  want  to  maintain 
an  extremely  high  hit  ratio.  It  is  well  known  that  it  is  easy  to  find  patterns  of  data  access  that 
make  very  poor  use  of  cache  memory.  It  may  not  be  possible  to  maintain  an  extremely  high  hit 
ratio  in  a  data  cache. 

Finally,  in  both  caching  and  paging  schemes,  maintaining  coherency  in  large  scale  multipro¬ 
cessors  may  be  associated  with  very  high  overheads. 
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2.1  Overview  of  Hashed  Cache  Data  Migration  Scheme 

We  support  distributed  data  structures  in  a  way  that  allows  an  explicit  and  controlled  mapping 
of  data  to  memory.  While  read  or  write  copies  of  each  data  element  can  be  assigned  to  any 
processor’s  memory,  longer  term  storage  of  each  data  element  is  assigned  to  a  specific  location 
in  the  memory  of  a  particular  processor.  A  processor  needing  to  read  or  write  an  array  element 
gets  a  copy  of  that  element.  We  constrain  the  form  of  programs  so  that  only  parallel  loops  or 
sequential  code  can  be  specified.  Each  data  element  can  be  written  to  by  at  most  one  processor 
during  the  course  of  a  single  set  of  (nested)  parallel  loops.  This  constraint  eliminates  the  need 
for  hardware  coherency  support.  At  the  end  of  a  set  of  nested  parallel  loops,  all  modified  data 
is  copied  back  to  its  home  location. 

A  loop  is  transformed  into  two  parts  by  a  compiler.  One  part  is  called  an  inspector,  the 
other  is  called  an  executor.  The  inspector  is  responsible  for  determining  what  data  elements  are 
required  by  a  loop,  the  executor  carries  out  the  actual  computations. 

In  a  distributed  memory  machine,  execution  time  procedures  cany  out  the  actual  fetching 
of  data.  Local  copies  of  data  ar:  stored  in  a  hash  tabic.  Access  to  the  table  must  be  very 
fast.  Location  the  in  hash  table  is  determined  by  taking  low  order  bits  of  a  quantity  that  is 
analogous  to  an  address  in  logical  memory.  Unlike  a  traditional  cache,  we  cannot  afford  to  allow 
data  elements  to  be  thrown  away  just  because  too  many  required  addresses  have  the  same  low 
order  bits.  We  instead  use  a  linked  list  when  more  than  a  single  data  element  hashes  to  a  given 
location. 

3  The  Hashed  Cache  System 

3.1  Support  of  Distributed  Arrays 

We  support  distributed  multidimensional  arrays  in  a  way  that  allows  an  explicit  and  controlled 
mapping  of  data  to  memory.  As  stated  above,  long  term  storage  of  each  data  element  is  assigned 
to  a  specific  location  in  the  memory  of  a  particular  processor.  Users  are  able  to  specify  the 
following  attributes  in  their  distributed  array  initializations: 

•  The  topology  of  the  processor  array  on  which  the  data  arrays  are  to  be  embedded 

•  The  dimensionality  of  the  data  arrays 

•  Subset  of  processors  used  to  store  array  elements 

•  Mapping  of  array  elements  onto  the  specified  processors 

Once  a  distributed  array  is  initialized,  we  can  use  the  specified  partitioning  information  to 
find,  for  any  distributed  array  element,  a  unique  processor  P  along  with  a  unique  location  in  that 
processor  P’s  storage.  In  each  processor,  contiguous  memory  locations  are  used  to  store  local 
elements  of  a  given  distributed  array.  The  unique  location  in  P’s  storage  can  thus  be  expressed 
as  an  integer  offset  0.  Further  details  of  our  support  for  distributed  arrays  are  beyond  the  scope 
of  this  note  but  can  be  found  in  [2],  [6]. 
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31  do  iter=l,  num 

32  do  i=l,n**2 

33  do  j=0,  m 

34  yCi]  +=  values  [i]  [j]*yold[nbrs[i]  [j]]  ; 
eiid  do 

end  do 

35  do  k=l,n**2 

yoldCk]  =  y[k]; 
end  do 

end  do 


Figure  1:  Sparse  Matrix  Vector  Multiply 

3.2  Details  of  the  Hashed  Cache  System 

The  program  in  Figure  1  w.U  be  used  as  a  running  example  in  this  discussion.  This  program 
performs  a  sequence  of  matrix  vector  multiplications.  In  order  to  compute  y  [i]  at  each  iteration, 
we  need  yold[nbrs[i]  [j]].  Both  y  and  yold  are  to  be  defined  as  distributed  arrays.  When 
the  loop  is  distributed,  loop  iterations  may  be  assigned  to  processors  in  an  arbitrary  fashion. 
Consequently  long  term  storage  of  elements  of  y  or  yold  may  not  be  assigned  to  the  processors 
that  execute  code  referring  to  those  elements. 

The  global  arrays  are  initialized  at  the  start  of  the  program.  We  proceed  to  describe  the 
primitives  that  support  inspector  and  executor  phases  of  the  hashed-cache  system.  To  best 
understand  the  details  of  the  inspector  and  executor  phases  we  describe  them  in  the  context  of 
the  example  presented  in  Figure  1. 

Firstly,  there  are  3  different  kind  of  references  to  global  arrays. 

1.  Local:  The  address  of  the  reference  corresponds  to  the  local  memory  of  the  processor.  The 
reference  may  be  a  read  or  a  write. 

2.  Non-Local  Read:  The  address  of  the  reference  corresponds  to  the  local  memory  of  some 
other  processor.  The  element  is  only  read. 

3.  Non-Local  Write:  The  address  of  the  reference  corresponds  to  the  local  memory  of  some 
other  processor;  the  element  is  written  to. 

3.3  The  Inspector  Phase 

Figure  2  depicts  the  psuedo-code  of  the  inspector  phase  for  the  sparse  matrix  vector  multiply. 
During  the  inspector  phase  we  go  through  the  inner-loop  once  to  check  for  hn  al  and  non-local 
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global  array  accesses.  If  an  array  reference  is  local  we  do  nothing.  However,  if  it  is  a  non-local 
reference  to  a  global  array,  we  compute  the  processor  on  which  the  element  resides  and  its  offset. 
We  need  to  store  this  information  in  such  a  way  that  accessing  it  is  efficient.  This  is  achieved 
by  using  a  hashed  cache  scheme. 

Initially,  we  allocate  a  certain  amount  of  memory  for  a  cache.  We  partition  the  cache  into 
blocks,  one  for  each  globally  defined  distributed  array.  Each  block  is  treated  as  a  separate  hash 
table.  The  location  of  a  non-local  distributed  ?rray  element  is  determined  by  a  hash  function. 
Currently  ,  we  use  a  hashing  function  that  simply  masks  the  lower  k  bits  of  the  key  where  k 
depends  on  the  size  of  the  hash  table.  The  key  is  formed  by  concatenating  the  processor-offset 
pair,  (P,  O),  that  corresponds  to  a  distributed  array  reference.  Each  entry  in  the  hash  table 
consists  of  the  following: 

1.  a  reference  to  the  non-local  data  item,  i.e.,  the  data  item’s  processor-offset  pair, 

2.  whether  the  item  is  to  be  read  (read  flag), 

3.  whether  the  item  is  to  be  written  (write  flag), 

4.  the  data  value  itself. 


If  the  data  item  is  a  non-local  read  reference  R,  it  is  processed  by  the  process-global-read() 
routine.  The  routine  is  described  as  follows: 

process-global-read() 

1.  Search  for  the  reference  R  in  the  hashed-cache. 

2.  If  R  exists  and  the  read  flag  is  set,  do  nothing. 

3.  If  R  exists  and  the  read  flag  is  not  set,  set  read  flag. 

4.  If  R  is  not  found  the  in  hashed  cache,  create  an  entry  with  read  flag  set  and  enter  it  in  the 
hashed-cache. 

5.  In  the  latter  two  situations,  increment  a  count  variable  that  contains  the  number  of  non¬ 
local  elements  to  be  gathered  from  the  processor  P  on  which  this  element  resides.  The 
offset  of  this  element  is  written  to  a  list  containing  the  offsets  of  all  the  elements  to  be 
gathered  from  P. 

Non-local  array  references  R  that  are  written  to,  are  processed  by  the  process-global-write () 
routine  described  below: 

process-global-  write  ( ) 

1.  Search  for  the  reference  R  in  the  hashed-cache. 
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Loop  over  local  iterations  i  assigned  to  P 
do  j  =  O.m 

Compute  processor,  offset  pair  for  element  of  yold 

If  yold  reference  is  to  non-local  array  element, 
process-global-readO 

end  do 

End  loop  over  local  iterations 

Loop  over  local  iterations  k  assigned  to  P 

If  yold  reference  is  to  non-local  array  element, 
process-global-write  0 

End  loop  over  local  iterations 


Figure  2:  Inspector;  Sparse  Matrix  Vector  Multiply 


2.  If  R  exists  and  the  write  flag  is  set,  do  nothing. 

3.  If  R  exists  and  the  write  flag  is  not  set,  set  write  flag. 

4.  If  R  is  not  found  in  hashed  cache,  create  an  entry  with  write  flag  set  and  enter  it  in  the 
hashed- cache. 

5.  In  the  latter  two  situations  increment  a  count  variable  containing  the  number  of  non-local 
elements  to  be  scattered  to  P.  The  offset  of  R  is  written  to  a  list  containing  the  offsets  of 
all  the  elements  to  be  scattered  to  P. 

At  the  completion  of  the  inspector  pha,se  we  precompute  the  communication  pattern  required 
to  efficiently  gather  or  scatter  all  relevant  non-local  data  referenced  in  the  loop.  This  requires  a 
global  communication  phase  in  which  all  processors  participate.  For  a  detailed  description,  see 
[6],  [2]. 
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3.4  The  Executor  Phase 


Figure  3  depicts  the  pseudo-code  of  the  eiecutor  phase  for  the  sparse  matrix  vector  multiply. 

The  non-local  data  required  by  the  inner  loop  is  first  obtained  from  other  processors  and 
stored  in  the  hashed-cache  by  the  process-gather-data  routine.  We  now  proceed  to  execute  the 
doall  loop. 

During  the  execution  of  the  inner-loop  we  check  each  distributed  array  reference  to  decide 
whether  it  resides  in  the  local  array  or  not.  If  It  does,  we  compute  the  offset  of  the  element  in 
the  local  array  and  fetch  the  data  item  from  the  appropriate  memory  location.  If  it  does  not,  we 
fetch  it  from  the  hashed  cache.  If  the  array  reference  occurs  on  the  left  hand  and  it  is  non-local, 
we  enter  the  new  value  in  the  hashed  cache.  At  the  end  of  the  execution  of  the  inner-loop,  each 
processor  calls  the  process-scatter-data()  routine.  This  routine  goes  through  the  list  of  non-local 
offsets  of  elements  to  be  scattered,  searches  for  these  elements  in  the  hashed  cache  and  writes 
the  value  to  a  list  containing  the  new  values  to  be  written  to  the  distributed  memory.  The  data 
is  then  scattered  to  the  distributed  memory. 

The  operations  for  computing  processor  number  and  offset  are  computationally  very  cheap 
since  we  assume  the  distributed  array  may  be  partitioned  in  a  block  or  block  wrap  fashion.  The 
size  of  each  block  is  a  power  of  2  and  thus  we  need  to  perform  simple  integer  operations  such  as 
shifts  to  compute  the  offset  and  processor  number  of  a  distributed  array  element. 


4  Experimental  Results 

The  program  depicted  in  Figure  1  exhibits  greatly  varying  patterns  of  locality  depending  on  how 
loop  iterations  are  assigned  to  processors  and  depending  on  the  contents  of  the  integer  array 
T'.t“ger  ran  b*'  vie-.v'''!  '■a  a  rep’-'”’er*^?tion  of  a  sparse  matrix.  We  used  a  synthetic 

workload  to  generate  a  number  of  sparse  matrices  with  differing  dependency  patterns.  A  square 
mesh  in  which  each  point  was  linked  to  four  nearest  neighbors  was  incrementally  distorted. 
Random  edges  were  introduced  subject  to  the  constraint  that  in  the  new  mesh,  each  point  still 
required  information  from  four  other  mesh  points. 

Our  workload  generator  makes  the  followur.g  assumptions- 

1.  The  problem  domain  consists  of  a  2-dimensional  mesh  of  points  which  are  numbered  using 
their  natural  ordering; 

2.  Each  point  is  initially  connected  to  its  four  nearest  neighbors 

3.  Each  link  produced  in  the  above  step  is  examined,  with  probability  Pr  the  link  is  replaced 
by  a  link  to  a  randomly  chosen  point. 

Once  generated,  this  connectivity  information  is  stored  in  integer  array  nbrs. 

To  obtain  an  experimental  estimate  of  the  efficiency  of  the  executor  on  the  Intel  iPSC/2,  we 
carried  out  a  sequence  of  sparse  matrix- vector  multiplications  using  a  128*128  matrix  generated 
from  a  square  mesh  using  the  w'orkload  generator  described  above. 


do  iter=l,  num 


process-gather-dataO  -  obtain  needed  yold  values  from 

other  processors, 

put  in  hashed  cache 

Loop  over  local  iterations  k  assigned  to  P 
do  j  =  0,m 

Perform  calculation  reading  yold  values  or 
writing  y  values  using  local  memory  or 
hashed  cache  as  is  appropriate. 

end  do 

End  loop  over  local  iterations 

Loop  over  local  iterations  k  assigned  to  P 

Perform  assignment  reading  y  values  or 
writing  yold  values  using  local  memory  or 
hashed  cache  as  is  appropriate. 

End  loop  over  local  iterations 

process-scatter-dataO  -  scatter  yold  values  from 
hashed  cache  to  appropriate 
processors 

end  do 


Figure  3:  Executor:  Sparse  Matrix  Vector  Multiply 
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V’e  define 


•  p  =  Total  Number  of  Processors 

•  Blocks ize  —  (128  X  128)/p 

•  ArraySize  =  128  X  128 

We  partitioned  yold  in  the  following  ways: 


1.  partition  the  array  in  contiguous  blocks  of  size  BlockSize,  i.e.  processor  i  is  assigned  indices 
zx  BlockSize  through  (  z  +  l)x  BlockSize -1 

2.  partition  the  indices  in  an  interleaved  fashion  i.e.  processor  i  isassignedindicesz,t  +  p,z  + 
2p,  ...,z+  [BlockSize  -  1)  xp. 

We  first  present  the  results  obtained  by  partitioning  yold  in  contiguous  blocks.  Table  1 
depicts  the  time  required  to  carry  out  the  inspector  and  executor  loops  along  with  the  optimal 
time.  We  define  the  optimal  time  as  the  sequential  time  divided  by  the  number  of  processors. 
The  inspector  took  a  time  roughly  equal  that  required  by  one  or  two  optimally  parallelized 
iterations.  Since  the  inner  loops  of  most  scientific  codes  consist  of  many  repetitions  of  loops 
with  invariant  dependency  patterns,  this  inspector  overhead  is  not  expected  to  be  a  serious 
performance  bottleneck  in  many  programs  of  practical  interest. 

We  define  parallel  efficiency  as  Tparaiuil{T,equentiai*  P)  where  T,fq^ential  is  the  time  taken  by 
a  sequential  program  to  run  on  a  single  processor,  P  is  the  number  of  processors  and  TparaiUl  is 
the  time  required  to  run  the  parallelized  program  on  P  processors.  The  parallel  efficiencies  were 
0.76,  0.73,  0.67  and  0.56  for  problems  run  on  4,  8,  16  and  32  processors  respectively.  We  obtained 
relatively  high  efficiencies  because  most  of  the  required  data  resided  in  the  local  memory  of  the 
processor.  We  also  ran  the  sparse  matrix  vector  multiply  on  32  processors  with  probabilities  Pr 
equal  to  0.2  and  0.4  that  an  edge  is  randomly  displaced.  As  we  increased  Pr,  we  encountered 
more  non-local  references.  Efficiencies  dropped  from  0.56  to  0.30  when  the  probability  was  0.2 
and  to  0.21  when  the  probability  was  0.4. 

We  next  present  results  obtained  using  an  interleaved  partition  of  yold.  When  an  interleaved 
partition  is  employed,  most  of  the  data  required  by  each  processor  is  non-local.  We  need  to 
fetch  large  amounts  of  data  from  the  hashed  cache.  Moreover,  there  are  also  non-local  write 
data  accesses  to  yold.  Thus  wo  have  to  write  the  new  values  to  the  distributed  memory.  The 
parallel  efficiencies  obtained  using  an  interleaved  partition  of  yold  are  depicted  in  Table  2.  As 
we  increased  the  probability  Pr,  the  number  of  edges  displaced  randomly  increases.  Due  to  the 
nature  of  the  array  partitioning,  it  is  likely  that  the  main  effect  of  of  increaseing  Pr  is  to  increase 
the  number  of  processors  from  which  data  needs  to  be  fetched.  We  see  a  decrease  in  efficiency 
as  we  increase  the  probability  of  randomly  displacing  an  edge. 

Single  processor  timings  of  an  optimized  version  were  compared  with  the  parallel  code  run 
on  one  processor.  The  sequential  code  required  Tsequentiai  —  64.4  milliseconds  and  the  one  pro¬ 
cessor  parallel  code  required  Tparaiui  =  81.4  milliseconds.  The  overhead  due  to  the  executor 
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Table  1:  Matrix- Vector  Multiply  -  Blocked  Partitioning 


Processors 

Inspector 

time 

(ms) 

Executor 

time 

(ms) 

Optimal 

time 

(ms) 

4 

11.8 

21.3 

16.1 

8 

7.3 

11.1 

8.05 

16 

4.8 

6.0 

4.03 

32 

4.3 

3.6 

2.01 

Table  2:  Matrix- Vector  Multiply  -  Interleaved  Partitioning 


Processors 

Pr  =  0.0 
efficiency 

Pr  =  0.2 
efficiency 

Pr  =  0.4 
efficiency 

0.29 

0.18 

0.19 

0.11 

■H 

is  approximately  26  percent.  We  expect  substantially  lower  overheads  on  RISC  based  multi¬ 
processors  both  because  of  the  high  prevalence  of  shift  operations  in  hashed  cache  calculations 
and  because  of  the  potential  for  being  able  to  concurrently  schedule  executor  floating  point  and 
integer  operations. 

5  Conclusion 

We  propose  a  data  migration  mechanism  that  allows  an  explicit  and  controlled  mapping  of  data 
to  memory.  While  read  or  write  copies  of  each  data  element  can  be  assigned  to  any  processor’s 
memiory,  longer  term  storage  of  each  data  element  is  assigned  to  a  specific  location  in  the  memory 
of  a  particular  processor.  Our  proposed  integration  of  a  data  migration  scheme  with  a  compiler 
is  able  to  eliminate  the  migration  of  unneeded  data  that  can  occur  in  multiprocessor  paging  or 
caching.  The  overhead  of  adjudicating  multiple  concurrent  writes  to  the  same  page  or  cache  line 
is  also  eliminated.  We  present  data  that  suggests  that  the  scheme  we  suggest  may  be  a  practical 
method  for  efficiently  supporting  data  migration. 
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